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v 29 minutes ago | device: /dev/sda5 instance: зегмег5 | | 28 minutes | instance: server2 


— 60f10 + 
@ help Q repo: Repo: https://github.com/prymitive/karma 


@ help 
Q summary: Silence this alert, it's always firing 


job: node_exporter | mount_point: /disk 
"@dluster: HA) @receiver: by-cluster-service 07 dashboard job: node exporter | region: us АЛАЯ 


Karma 


MEM Legacy monitoring systems 


МАЛЬЧИК КОТОРЫЙ КРИЧАЛ «ВОЛКИ!» 


DDAKUU 
Ж 


МНЕ СТРАШНО! 
уто он хочЕт 
от MAC) 
ЗАТКНИСЬ! 
Он услищит! 
=== 


° Monitoring Problems 

° Legacy Infrastructure Service Discovery 

° Prometheus Operator as the Solution 

е Paas Alerting and Dashboards — Monitoring As A Code 


° Upgrades and Incidents 


е Conclusion 
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° More then one monitoring system 


° Legacy Infrastructure Service Discovery 
° Prometheus Operator as the Solution 
° PaaS Alerting апа Dashboards — Monitoring As A Code 


° Upgrades and Incidents 


е Conclusion 


MEM Prometheus Service Discovery with Consul 


е 


HashiCorp 


Consul 


NETDATA 


Metrics Discovery Agent 


Prometheus 


makeameme.org 


Prometheus Service Discovery with Consul 


_ ' consul 


: 'consul-server.consul:8500' 
: #1 consul_dc )) 
: ["monitoring"] 


: | meta consul service] 
: job 

: | meta consul node] 
: instance 


: netdata 
: '/api/v1l/allmetrics' 


: [prometheus] 


: 'consul-server.consul:8500' 
: {{ consul dc Jj 
Ж  ["netdata"] 


: І meta consul service] 
е · Job 

: І meta consul node] 
: instance 
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° Prometheus Operator as the Solution 
° PaaS Alerting апа Dashboards — Monitoring As A Code 


° Upgrades and Incidents 


е Conclusion 


MEM Kubernetes 


WHAT Gives PEOPLE 
FEELINGS OF POWER 


kubernetes 
cluster 


Kubernetes 


Deployment 
DaemonSet ReplicaSet StatefulSet Job Replication 
Controller 
წი Horizontal Pod Container Vertical Pod Pod Disruption 
Autoscaler (your code) Autoscaler Budget 
Persistent HostPath; 
ии VolumeClaim Dem) EmptyDir 


x Kubernetes Custom Resource Definition 


YOU SAID ALLI 
m AYAML FILE 
М 


ETT, < 


- ee 
UL. qp mmm 
/ 77: 5775 № 


Р? > 
t 


MEM Kubernetes Custorm Resource Definition 


/ CRD 


name: "Fruit" 


Table: "Fruit" 


(name) "sweetness" "weight" 


sweetness false 
weight banana | true 


CR 
name: apple 
sweetness: false 
weight: 100 


MEM Kubernetes Operator 


Kubernetes Operator | 


Reconcile Period (60s) 


| Operator | 


Create/Modify 


Reconcile 


Prometheus Operator — Architecture Overview 


0 


Operator 
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ServiceMonitor 


apiVersion: monitoring.coreos.com/v1 
kind: ServiceMonitor 
metadata: 
labels: 
app: prometheus-operator-alertmanager 
chart: prometheus-operator-9.3.2 
heritage: Helm 
release: prometheus-operator 
name: prometheus-operator-alertmanager 
namespace: monitoring 
spec: 
endpoints: 
- path: /metrics 
port: web 
namespaceSelector: 
matchNames: 
- monitoring 
selector: 
matchLabels: 
app: prometheus-operator-alertmanager 
release: prometheus-operator 
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ServiceMonitor 


~ ^ ж wsp.production % ^ k get servicemonitors.monitoring.coreos.com 
AME AGE 
prometheus-operator-alertmanager 199d 
orometheus-operator-apiserver 199d 
orometheus-operator-coredns 199d 
orometheus-operator-grafana 199d 
orometheus-operator-kube-controller-manager 199d 


orometheus-operator-kube-etcd 199d 
orometheus-operator-kube-proxy 187d 
orometheus-operator-kube-scheduler 199d 
orometheus-operator-kube-state-metrics 199d 
orometheus-operator-kubeLlet 199d 
orometheus-operator-node-exporter 199d 
orometheus-operator-operator 199d 
orometheus-operator-prometheus 199d 


— 


ServiceMonitor 
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ServiceMonitor 


ns:openshift-monitoring | ns:my-app 


kind: ServiceMonitor 
metadata: 


my-app 
labels: 
k8s-app: my-service-monitor pod 
name: my-service-monitor 
spec: match 


endpoints: 
- interval: 30s 
port: metrics 


kind: Service 


scheme: http metadata: 
namespaceSelector: labels: - 
No SES k8s-app: my-app 
name: my-a 
read selector: mu 


matchLabels: | spec: 
К85-арр: ту-арр ports: 
- name: metrics 
port: 8080 


selector: 
k8s-app: my-app 


Scraps metrics from service 


Creates config and 
exposes to Prometheus 
ій pods pod 


& 


| Elements already in place - deployed/managed by cluster-monitoring-operator 
IE Custom elements to be deployed by the user 


MEM Prometheus Operator — how it works 


U Prometheüs 
u — | Config 


Target Pods |«-Scrapes- Prometheus | 


_ Manages 
Sends alerts | 


Alertmanager =Manages— 


Sends alerts 


Receivers 


PrometheusRules 


‚ kind: PrometheusRule 
metadata: ~~ 
annotatuons: 
prometheus-operator-validated: "true" 
labels: 
app: prometheus-operator 
chart: prometheus-operator-9.3.2 
heritage: Helm 
release: prometheus-operator 
name: prometheus-operator-etcd 
namespace: monitoring 
spec: 
groups: 
- name: etcd 
rules: мэ. | 
_- alert: etcdInsufficientMembers 
|! annotations: 
message: 'etcd cluster "44 $labels.job }}": insufficient members (44 $value 
Fr). | 
expr: sum(up{job=»".*etcd.*"} == bool 1) by (job) < ((count(upf{job=»r".*etcd.*"}) 
by (job) + 1) / 2) 
| for: 3m სსს 
labels: . 
severity: critical 


plesk 


w PrometheusRules — alerts out-of-the-box 


plesk 


~ ж wSp.production 
NAME 
prometheus-operator-alertmanager 
prometheus-operator-blackbox 
prometheus-operator-cert-manager 
prometheus-operator-etcd 
prometheus-operator-general 
prometheus-operator-k8s 
prometheus-operator-kube-apiserver-error 
prometheus-operator-kube-apiserver 
prometheus-operator-kube-prometheus-node-recording 
prometheus-operator-kube-scheduler 
prometheus-operator-kubernetes-absent 
prometheus-operator-kubernetes-apps 
prometheus-operator-kubernetes-resources 
prometheus-operator-kubernetes-storage 
prometheus-operator-kubernetes-system 
prometheus-operator-kubernetes-system-apiserver 
prometheus-operator-kubernetes-system-controller-manager 
prometheus-operator-kubernetes-system-kubelet 
prometheus-operator-kubernetes-system-scheduler 


monitoring % 


k get prometheusrules.monitoring.coreos.com 


AGE 

199d 
151d 
151d 
199d 
199d 
199d 
117d 
199d 
199d 
199d 
117d 
199d 
199d 
199d 
199d 
199d 
199d 
199d 
199d 


Kubernetes-mixin — all-in-one k8s monitoring 


monitoring % > k get configmaps | grep prometheus-operator 
-apiserver 1 201d 
-cluster-total 201d 
-controller-manager 201d 
-etcd 201d 
-grafana 201d 
-grafana-config-dashboards 201d 
-grafana-datasource 201d 
-grafana-test 201d 
-k8s-coredns 201d 
-k8s-resources-cluster 201d 
-k8s-resources-namespace 201d 
-k8s-resources-node 201d 
-k8s-resources-pod 201d 
-k8s-resources-workload 201d 
-k8s-resources-workloads-namespace 201d 
-kubelet 201d 
-namespace-by-pod 201d 
-namespace-by-workload 201d 
-node-cluster-rsrc-use 201d 
-node-rsrc-use 201d 
-nodes 201d 
-persistentvolumesusage 201d 
-pod-total 201d 
-pods 120d 
-prometheus 201d 
-ргоху 189d 
-Scheduler 201d 
-statefulset 201d 
-workload-total 201d 
prometheus- -prometheus-rulefiles-0 40d 


UJ к> к” к> к> къ HE кэ Де RP p p RP RPP PPP PPP PPP PNP P = 


Kubernetes-mixin — all-in-one k8s monitoring 


RPC Rate Active Streams 


DB Size Disk Sync Duration 
100.6 MiB Om 257 MiB 


100.1 MiB 248 МІВ 
300 ms | 
99.7 MiB 238 MiB | 


99.2 MIB 


200 те 26 " " 
98.7 MiB 219 тт || | 
100 ms | ||| | | || | 
98.2 MiB а | - (179 
за x Қ» ы, жала he лы a мы а лл ва и 


9/7 / 9/9 9/10 9/11 9/12 9/13 


Client Traffic In Client Traffic Out Peer Traffic In Peer Traffic Out 
25 kBs 400 kBs 60 kBs 
20 kBs | | | - | 50 kBs 


40 kBs 
15 kBs 


п 


200 kBs | | | | 30 kBs 
10 kBs 
100 kBs 


085 == 10 kBs 


Raft Proposals Total Leader Elections Per Day 


plesk 


Kubernetes-mixin — all-in-one k8s monitoring 


88 Kubernetes / Kubelet - 


Prometheus " 


Up Running Pods > Running Container Actual Volume Count Desired Volume Count Config Error Count 


Operation Rate Operation Error Rate 


10,52 1.56:10250 container status 0 ops 2.1.56:10250 container status 
0.04 ops 


10.52.1.56:10250 create container 0 ops .52.1.56:10250 exec, sync 

: 0.03 орз 
10.52.1.56:10250 ехес 0 орз . 56:10250 podsandbox_status 
10.52.1.56:10250 exec_sync 1,567 ор5 0.02 ops | 10. 56:10250 ри! image 


| 10.52.1.56:10250 image. status 0.004 ops 10.52.1.56:10250 remove. container 
5 ops 0.01 ops 


= + 1 10,52.1.56:10250 list containers 1.841 ops 56:10250 run. podsandbox 

_–__–_-...__-_-__ჟ : 1 Ll 1 l 

Oops => x Да: 10.52.1.56:10250 list images 0.037 ops 9 оре = Б) PT ТТЫ” Reg Е 56:10250 stop_podsandbox 
9/11 12:00 9/12 00:00 9/12 12:00 9/13 00:00 9/ 0 9/12 00:00 9/12 12:00 9/13 00:00 


Operation duration 99th quantile 


56:10250 container status 

1.56:10250 create, container 
56:10250 exec 

1.56:10250 exec. sync 476 ms 
56:10250 image status 


f | | | 52.1.56:10250 list_containers 


. 
E I ui he" ПРО TEE A, =- imd w тээ ` 22 Zn u алин, а з ” ” = == " 


- = 2 я NER > з = 1.56:10250 list images 
9/11 04:00 9/11 08:00 9/11 12:00 9/11 16:00 9/11 20:00 9/12 00:00 9/12 0400 9/12 08:00 9/12 12:00 9/12 16:00 9/12 20:00 9/13 00:00 


Pod Start Rate Pod Start Duration 


0.040 ops 


plesk 


Longterm storage Overview 


MEM Longterm storage Overview 


Monitoring Prob) 
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° Upgrades and Incidents 


е Conclusion 


Deploy Helm Chart via Ansible 


Al, 
HELM 
л” “რ ~ 


ж 


t | ns 
: UM 
Е Е" 
in. 


PROMETHEUS-OPERATOR 
VS 
KUBE-PROMETHEUS-STACK 


A DEPRECATED 


Further development has moved to prometheus-community/helm-charts. The chart has been renamed kube-prometheus-stack to more 
clearly reflect that it installs the kube-prometheus project stack, within which Prometheus Operator is only one component. 


MEM Paas Alerting and Dashboards — Monitoring As A Code 


x PaaS Alerting and Dashboards — Monitoring As A Code 


- name: 
К85: 
definition: 
apiVersion: 
Kind: 
metadata: 
labels: 
app: 
generated: 
release: 
name: 
namespace: 
spec: 
groups: 
- name: 
rules: 


loop: 
loop_control: 
label: 


plesk 


Paas Alerting and Dashboards — Monitoring As A Code 


4 netdataMetricsUnreachable 
: netdata:reachability:bool == 
: 5m 


' critical 
: ops 


: "44 $labels.instance LI netdata metrics unreachable from prometheus for 5m." 
: troubleshooting/node.md#netdataMetricsUnreachable 
: 6bab49de 
' netdata:reachability:bool{{"{"}}instance="{{ $labels.instance }}"{{"}"}} 
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Paas Alerting and Dashboards — Monitoring As A Code 


vl 
: ConfigMap 


"true: 
"true" 
: monitoring 
: grafana-dashboard-{{ (item.path | basename | splitext)[®] }} 


{{'grafana-dashboard-%s' | format( item.path | basename ) JJ: 
{1 lookup('file', item.path) | indent(6, first=True, blank=True) }} 
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° ETCD endpoints — не работают из коробки 

е Изменение CRD при апгрейде = fail upgrade 

° Удаление ServiceMonitor при удаление һеіт-релиза 

е Неоптимальные параметры ServiceMonitor 
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° Kube-apiserver умирает при AGB при деплое helm-chart 
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е Conclusion 


° “Админы” больше He bottleneck 
° Разработчики самостоятельно доставляют бизнес-метрики 


е Решение легко встраивается в любой k8s pipeline 


е Легко ложится Ha infrastructure as a code 


TRE END 
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